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1  Problem  Statement 

This  effort  involved  development  of  an  information  theoretic  viewpoint  for  the  model-based 
vision  problem.  This  viewpoint  brings  the  problems  of  recognition  and  pose  determination 
into  the  classical  framework  of  multidimensional  signal  detection  and  demodulation  theory.  It 
generates  a  geometric  interpretation  which  allows  an  intuitive  understanding  of  the  behavior 
of  ATR  systems,  thereby  providing  insight  into  the  tradeoffs  that  determine  ultimate  per¬ 
formance,  and  suggesting  means  for  optimization.  This  viewpoint  also  illustrates  why  ATR 
systems  that  perform  well  in  some  cases  fail  in  others,  and  subsumes  the  problem  sensor  data 
fusion  within  the  same  paradigm.  Thus,  it  potentially  constitutes  a  unified  framework  for 
the  analysis,  design,  evaluation,  and  comparison  of  fusion-based  object  recognition  systems. 

The  initial  goal  of  this  investigation  was  the  development  of  a  unified  framework  which 
exploits  the  parallels  between  classical  communication/information  theory  and  the  model- 
based  ATR  problem.  This  initial  effort  focused  on  model  characterization  and  ultimate 
system  performance  bounds,  in  addition  to  development  of  the  analogy  between  the  com¬ 
munication  problem  and  the  ATR  problem.  An  additional  goal  was  the  characterization 
of  performance  tradeoffs  between  system  complexity,  system  performance,  and  system  fail¬ 
ure  modes  for  multi-sensor  ATR  systems,  and  development  of  means  for  evaluation  and 
comparison  of  fusion-based  ATR  systems. 

It  is  anticipated  that  this  will  lead  to  development  of  guidelines  for  design  and  analysis 
of  fusion-based  ATR  systems.  Thus,  future  work  could  include  investigation  into  exploiting 
system  performance  tradeoffs  in  order  to  match  ATR  techniques  to  specific  sensor  modalities 
and  applications,  as  well  as  development  of  model  construction  and  processing  techniques 
for  system  performance  optimization. 
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2  Summary  of  Results 

In  the  course  of  this  project  the  investigators  have  made  significant  progress  with  respect  to 
two  directions  concomitant  to  the  stated  goals  of  the  project: 

•  investigation  of  information  theoretic  performance  bounds  for  ATR  performance; 

•  design  of  ATR  systems  based  on  a  direct  consideration  of  information  theoretic  mea¬ 
sures. 

The  following  subsections  summarize  the  more  significant  results  in  these  two  directions. 
Subsequent  sections  provide  further  detail  with  respect  to  work  conducted  since  the  annual 
report  of  January  1997. 

2.1  Information  Theoretic  ATR  Performance  Bounds 

2.1.1  Motivation 

The  development  of  Automatic  Target  Recognition  (ATR)  systems  has  had  a  long  history, 
which  has  seen  successive  generations  of  better  performing  but  increasingly  more  complex 
systems.  However,  little  is  known  about  the  ultimate  performance  that  yet  could  be  achieved 
by  ATR  systems  regardless  of  the  type  or  complexity  of  the  algorithms  used. 

Information  theory,  developed  in  the  1940’s  by  Claude  Shannon,  has  two  primary  goals, 
both  within  the  context  of  communicating  a  given  information  source  over  a  channel  using 
coding  schemes  from  a  given  class.  [5].  The  first  goal  is  the  discovery  of  the  fundamental 
theoretical  limits  of  achievable  performance.  The  second  goal  is  the  development  of  coding 
schemes  that  provide  performance  that  is  reasonably  close  to  the  optimal  performance  given 
by  the  theory. 

The  great  success  of  information  theory  at  describing  performance  bounds  in  terms  of 
channel  characteristics  that  hold  for  any  communications  system  together  with  the  funda¬ 
mental  relation  between  information  theory  and  hypothesis  testing  (For  example,  likelihood 
ratio  tests,  Stein’s  lemma,  and  the  Cheraoff  bound  all  can  be  expressed  in  terms  of  relative 
entropy  [1])  motivate  our  application  of  information  theory  to  the  investigation  of  algorithm 
independent,  theoretical  upper  bounds  on  the  performance  of  ATR  systems. 

In  order  to  investigate  the  possibility  of  developing  bounds  for  the  absolute  performance 
of  ATR  systems,  we  researched  existing  information  theoretic  bounds  in  an  attempt  to  extend 
their  use  to  the  ATR  problem.  Extensive  theoretical  and  experimental  work  was  done  to 
apply  and  extend  work  done  by  Blahut  and  Fano  to  determine  not  only  the  usefulness  of  the 
bounds  but  also  the  computational  implications  of  developing  ATR  bounds. 

We  turned  to  work  which  had  been  done  by  Blahut  to  determine  asymptotic  bounds 
for  the  performance  of  binary  hypothesis  testers,  which  was  based  on  channel  coding  the¬ 
ory.  Blahut  used  this  analogy  between  hypothesis  testing  and  channel  coding  to  develop  an 
information  theoretic  framework  within  which  both  upper  and  lower  performance  bounds 
could  be  developed.  We  were  able  to  apply  this  idea  to  the  ATR  problem  to  develop  both 
upper  and  lower  bounds  for  ATR  performance.  These  bounds  were  tested  numerically,  and 
compared  with  true  ROC  operating  characteristics  developed  using  a  Neyman-Pearson  test. 
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The  resulting  bounds  were  not  sufficiently  tight  when  compared  to  an  actual  system  ROC 
so  as  to  provide  useful  system  characterization. 

We  then  investigated  application  of  Fano’s  inequality,  the  principal  inequality  underlying 
many  of  the  information  theoretic  bounds  which  have  been  developed  for  communications 
applications.  The  usual  application  of  Fano’s  inequality  results  in  a  bound  on  total  error 
probability.  In  the  context  of  ATR  however,  the  Neyman-Pearson  viewpoint  of  detection 
versus  false  alarm  rate  probabilities  is  of  much  more  usefulness,  since  it  is  typically  not 
possible  to  describe  absolute  a  priori  probabilites  of  the  hypotheses  or  of  the  costs  associated 
with  each  type  of  error.  We  have  produced  an  extension  of  the  Fano  bound  which  generates 
a  bound  on  the  Neyman-Pearson  ROC  curve  as  the  envelope  of  a  family  of  straight-line 
bounds.  This  extension  produces  a  tighter  upper  bound  than  the  Blahut  bound  on  ATR 
system  performance,  as  assessed  by  our  numerical  tests. 

It  is  noteworthy  that  evaluation  of  these  bounds  requires  as  much  information  about  a 
system  and  as  much  computation  as  needed  to  produce  the  actual  ROC  curve  itself.  Thus, 
application  of  these  bounds  is  no  more  computationally  practical  for  application  to  a  complex 
ATR  scenario  than  direct  computation  of  the  ROC. 

2.1.2  Information  Theoretic  ATR  System  Design 

Our  other  efforts  have  focused  on  the  design  of  an  ATR  system  architecture  which  is  directly 
motivated  by  an  information  theoretic  posing  of  the  ATR  problem.  Using  the  geometric 
image  space  model  for  representation  of  target  instantiations  which  was  presented  in  the 
proposal  for  this  work,  we  have  developed  an  Information  Theoretic  Decision  Tree  (ITDT) 
ATR  procedure.  This  approach  frames  ATR  as  a  sequence  of  decisions  leading  to  object 
identification  and  pose  estimation  in  which  the  number  of  binary  decisions  is  minimized  by 
maximizing  the  information  gained  by  each  decision.  These  decisions  are  framed  in  the  native 
image  space  defined  by  the  raw  object  image  data,  so  as  to  not  sacrifice  information  due  to 
artificially  imposed  lossy  data  compression,  as  is  the  case  with  any  reduced-data  approach 
to  ATR. 

Since  each  decision  in  the  ITDT  process  seeks  to  maximize  the  information  gained  by 
that  decision  in  the  sense  of  discrimination  between  object  instantiations,  establishment  of 
the  decision  regions  is  an  NP-complete  global  optimization  problem.  However,  we  have 
developed  a  local  optimization  approach  based  on  gradient  descent  techniques  in  which 
the  non-optimality  of  the  decision  sacrifices  only  ultimate  speed  of  processing,  and  not  the 
ultimate  ATR  performance. 

We  have  tested  this  system  with  respect  to  recognition  of  targets  with  azimuthal  free¬ 
dom,  from  SAR  imagery,  embedded  in  clutter.  The  results,-  shown  in  the  sequel,  indicate 
performance  levels  well  above  that  of  other  ATR  systems  for  pose-variable  targets  currently 
under  study  at  WPI  with  significantly  less  processing. 

These  encouraging  preliminary  results  provide  evidence  that  information  theoretic  con¬ 
siderations  can  be  used  as  the  basis  for  ATR  system  design. 
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2.1.3  Hierarchical  Decision  Tree  based  ATR 

Hierarchical  search  of  a  database  of  target  exemplars  offers  the  possibility  of  vast  decreases 
in  processing  time  for  a  given  ATR  performance  since: 

•  Hierarchical  search  for  data  entities  that  can  be  arranged  in  an  ordered  binary  tree  can 
yield  linear  processing  time  growth  for  exponentially  increasing  data  sets.  Hence  the 
much  desired  goal  of  linear  growth  of  ATR  processing  time  with  numbers  of  degrees  of 
target  pose  and  environment  freedom  seems  obtainable. 

Hierarchical  search  of  a  database  of  target  exemplars  has  often  been  investigated  and 
rejected  by  various  researchers  owing  to  spurious  decision  behaviors  discovered  on  extensive 
testing  because  typical  implementations  of  hierarchical  search: 

•  pre-suppose  that  the  model  imagery  forms  a  totally  ordered  (sub-linearly  searchable), 
which  in  general  it  does  not; 

•  reduce  the  dimensionality  of  the  imagery  by  some  form  of  projection  to  obtain  a  search¬ 
able  set  with  an  unavoidable  accompanying  loss  is  discrimination  power; 

•  apply  hash  functions  without  regard  to  the  unruly  effects  of  image  degradation  (from 
sensor  noise,  environment  and  target  variation)  on  the  hash  function  output. 

2.1.4  Information  Theoretic  Decision  Tree  based  ATR 

During  the  first  portion  of  this  investigation  into  the  implications  of  information  theory  with 
regards  to  the  ultimate  performance  and  bounding  of  performance  for  ATR,  certain  insights 
arose  regarding  effectual  construction  of  decision  tree  based  ATR  systems: 

•  Total  ordering  can  be  imposed  upon  the  exemplar  set  by  “extension”  of  the  set  in  which 
new  set  members  are  generated  by  properly  dividing  the  original  exemplar  equivalence 
classes  so  that  a  given  exemplar  may  be  represented  by  several  equivalence  classes; 

•  the  original  image  space  should  be  divided,  and  not  a  projection  of  the  true  image 
space,  to  retain  all  relevant  information; 

•  the  division  of  the  space  should  be  driven  by  maximization  of  an  information  theoretic 
measure  (to  be  described  below)  to  minimize  the  number  of  decisions  and  hence  the 
amount  of  computation; 

•  the  penalty  for  approximation  in  the  division  of  the  decision  space  in  this  case  is 
increased  computation  but  not  loss  of  performance. 

2.2  Discussion  of  the  ITDT  Development 

The  following  slide  depicts  the  placement  of  points  in  a  3-D  decision  space  generated  from 
the  intensity  of  three  pixels  in  a  sequence  of  14  target  exemplar  images.  Each  of  the  points 
represents  the  center  of  density  function  which  models  the  stochastic  behavior  of  the  target, 
sensor  and  environment. 
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Figure  1:  Depiction  of  a  3-D  decision  space  occupied  by  points  derived  from  image  data. 


When  considering  the  question  of  how  to  divide  the  problem  of  image  recognition  into  a 
set  of  simple,  hierarchical  steps,  we  intuitively  are  led  to  the  notion  of  successively  splitting 
the  space  by  a  set  of  planes.  Choosing  these  planes  to  segregate  exemplar  points  into 
approximately  equal  subsets  yields  a  binary  search  tree  that  can  be  navigated  in  a  log2{M ) 
steps  given  M  exemplars.  In  later  portions  of  this  report  we  will  explore  the  mathematical 
basis  and  optimization  of  this  division  operation.  At  this  point  we  will  simply  consider  the 
overall  notion  and  its  implications. 

Now  consider  the  placement  of  a  decision  plane  that  divides  this  space  into  regions 
containing  approximately  half  the  number  of  image  points.  The  process  of  dividing  this 
space  must  take  into  account  the  probability  density  function  that  describes  the  distribution 
of  points  that  would  be  obtained  due  to  the  stochastic  nature  of  the  target,  environment 
and  sensor  system.  For  illustrative  purposes  in  this  example,  we  will  show  the  extent  of  the 
1  sigma  radius  of  the  density  functions  as  the  surface  of  a  translucent  ball  in  the  3-space  as 
shown  in  Figure  2.  A  division  of  this  space  is  depicted  in  Figure  3. 

It  is  inevitable  that  a  given  decision  plane  will  cut  through  many  (if  not  all)  of  the 
distributions  associated  with  the  exemplar  points.  More  importantly,  we  must  consider 
that  the  exemplars  are  segregated  by  a  true  equivalence  class  structure  that  is  generated  by 
evaluation  of  the  classic  likelihood  function  associated  with  the  given  density  functions.  It 
is  the  event  in  which  a  decision  plane  cuts  through  the  boundaries  of  the  true  equivalence 
classes  that  gives  rise  to  an  irrecoverable  loss  of  ATR  performance.  That  is,  upon  segregating 
exemplars  on  this  basis  of  such  a  division,  we  have  placed  a  region  of  the  decision  space 
associated  with  a  given  exemplar  on  a  side  of  the  decision  tree  that  can  never  yield  that 
exemplar  as  a  best  match. 
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Figure  2:  Depiction  of  the  S-D  decision  space  in  which  the  stochastic  nature  of  the  imagery  is 
represented  by  spheres  of  radius  equal  to  one  standard  deviation  of  the  associated  probability 
density. 

A  means  to  circumvent  this  problem  is  to  place  such  exemplars  in  both  subsets  generated 
by  the  division.  As  a  result,  certain  exemplars  will  appear  in  several  exemplar  subsets  asso¬ 
ciated  with  the  decision  tree  branches,  implicitly  generating  the  equivalence  class  extension 
needed  to  generate  a  totally  ordered  set  of  classes.  It  is  by  applying  a  simplified  scheme 
derived  from  this  notion  that  the  implementation  tested  below  demonstrated  freedom  from 
the  poor  early  decision  problem  often  found  with  hierarchical  systems. 

2.2.1  Decision  hyperplane  placement 

In  the  more  detailed  treatment  to  follow,  it  will  be  argued  that  the  placement  of  decision 
hyperplanes  that  cut  through  the  image  space  and  divide  it  into  equivalence  regions,  and 
the  exemplars  into  subsets,  is  properly  determined  by  the  maximization  of  the  information 
measure: 

/({?>,};  n = w({Pi}) + non  -  nM. n 

where  {p,}  represents  the  joint  density  function  (determined  by  the  exemplars  and  over¬ 
all  statistical  system  behavior)  and  Y  is  the  binary  valued  function  over  the  image  space 
determined  by  the  hyperplane  position. 

This  optimization  maximizes  the  information  gained  about  the  identity  of  any  point  in 
that  space  which  respect  to  exemplar  equivalence  by  means  of  that  decision.  Obviously 
we  can  gain  at  most  1  bit  with  each  decision.  Gaining  one  bit  per  decision  would  allow 
determination  of  membership  of  any  image  in  an  exemplar  decision  set  in  the  minimum, 
logarithmic,  time. 
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Figure  3:  Depiction  of  the  division  of  the  S-D  decision  space  such  that  each  half-space  is 
occupied  by  approximately  one-half  of  the  exemplar  image  points. 


2.2.2  Significant  Attributes  of  the  ITDT  ATR 

The  Information  Theoretic  Decision  Tree  ATR  possesses  the  following  attributes: 

•  The  on-line  ITDT  decision  process  grows  approximately  logarithmically  with  exemplar 
size.  With  the  106  exemplar  based  system,  to  be  demonstrated,  only  7  decisions  are 
needed  to  establish  the  nearest  exemplar  to  a  given  input  image. 

•  The  off-line  ITDT  construction  process  is  very  computationally  intensive.  Computa¬ 
tion  of  a  106  target  exemplar  based  system  with  the  simplest  additive  i.i.d.  statistical 
description  of  sensor/target  stochastic  behavior  requires  2  hours  on  a  200  MHz  work¬ 
station.  This  will  increase  linearly  with  exemplar  size. 

•  The  off-line  computation  involves  the  approximate  solution  of  a  global  optimization 
problem  involving  mutual  information  measures  computed  from  exemplar  statistics. 
As  in  any  attempt  at  global-optimization  with  bounded  computational  resources,  the 
solution  is  only  a  local  optimum.  The  beauty  of  the  ITDT  approach  is  that  the  non- 
global  nature  of  the  solution  has  no  effect  on  the  performance  of  the  ATR,  it  only 
potentially  increases  the  number  of  decisions  that  must  be  executed.  (In  all  of  our 
1-DOF  tests,  to  date,  we  have  not  seen  more  than  a  single  additional  decision  level  as 
a  result  of  this  non-optimality.) 

2.3  ITDT  Performance  Results 

We  have  constructed  a  prototype  that  demonstrates  the  new  approach  for  1-DOF  problems 
and  have  tested  this  system  with  the  same  SAR  data  sets  used  in  the  ARO  and  DARPA 
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sponsored  project  that  has  led  to  the  development  of  two  other  ATR  systems  for  variable 
target  pose  recognition,  LSD/DOA  and  PERFORM.  As  we  will  compare  performance  with 
these  two  systems,  a  brief  description  of  each  follows: 

2.3.1  LSD/DOA  ATR  architecture 

Developed  by  WPI  under  a  grant  from  the  ARO  and  DARPA  the  LSD/DOA  (Linear  Signal 
Decomposition  /  Direction  of  Arrival)  ATR  is  a  model-based  target  identification  and  pose 
estimation  algorithm  which  reduces  a  pose  dependent  object  description  into  a  small,  fixed, 
non-linear  filter  system  for  ATR  and  estimation  of  pose  parameters  for  targets  with  unknown 
pose  [7,  8]. 

The  technique  factors  the  ATR  problem  into  two  components:  an  off-line  analytic  model 
construction  process  and  an' on-line  direct  determination  of  object  identity  and  pose.  This  ar¬ 
chitecture  allows  fast,  non-iterative  object  recognition  by  shifting  the  computational  burden 
of  ATR  to  the  off-line  filter  construction. 

While  for  classic  ATR  systems  the  computational  burden  lies  in  the  on-line  (real-time) 
processing,  involving  repeated  interrogation  of  the  object  model,  for  LSD/DOA  the  burden 
lies  in  the  off-line  (pre-mission)  reciprocal-basis-set  (RBS)  construction  process.  This  basis 
set  is  constructed  in  such  a  way  that  linear  projection  of  an  acquired  object  image  onto  the 
basis  images  yields  a  set  of  harmonically  related  multidimensional  complex  samples.  From 
these  samples  object  pose  parameters  may  be  estimated  using  direction-of-arrival  (DOA) 
techniques.  Thus,  LSD/DOA  provides  a  means  to  perform  ATR  with  any  object  type  with 
parameterizable  behavior:  opaque,  refractive,  rigid,  articulated,  etc. 

Furthermore,  the  decomposition  of  the  recognition  system  into  a  variable  size  system  of 
linear  filters  that  generate  the  data  for  non-linear  estimation,  allows  configuration  that  can 
balance  computational  complexity  and  data  storage  with  performance  requirements. 

2.3.2  PERFORM  ATR  architecture 

The  Partial  Evidence  Reconstruction  FVom  Object  Restricted  Measures  (PERFORM)  algo¬ 
rithm  was  derived  from  LSD/DOA  for  the  purposes  of  demonstrating  recognition  of  partially 
obscured  targets  [9,  10].  Its  derivation  is  based  upon  the  observation  that  by  confining  an 
RBS  support  region  to  the  interior  of  the  object  support  intersection  over  the  full  pose 
range,  we  can  obtain  information  about  an  object  without  introduction  of  clutter  induced 
corruption  without  an  accompanying  loss  of  object  information  and  hence  discrimination 
power. 

By  using  several  such  target  embedded  estimators  we  can  develop  a  set  of  partial  object 
evidence  which  can  be  assembled  into  a  single  object  pose  and  identity  hypothesis.  A  direct 
benefit  is  greatly  increased  robustness  to  object  obscuration  in  so  much  as  certain  cover 
estimators  may  be  altogether  unaffected  by  partial  obscuration. 

2.3.3  Performance  of  the  ITDT  ATR 

In  the  prototype  ITDT  implementation,  a  1369  dimensional  space  was  populated  by  106  ex¬ 
emplar  points  with  a  distribution  model  based  on  the  assumption  of  additive  i.i.d.  Gaussian 
noise.  A  7  level  decision  tree  is  constructed  from  this  data. 
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The  system  was  tested  using  SAR  imagery.  The  target  data  was  generated  from  spotlight 
SAR  phase  history  files  provided  by  Wright  Laboratories,  Wright-Patterson  AFB.  The  test 
results  presented  below  are  based  on  data  from  a  Soviet  T72  tank.  The  target  exemplars 
were  generated  from  L  band  data,  a  10  degree  elevation  angle,  and  HH  and  W  polarization 
data  which  were  used  to  form  a  single,  polarimetrically  whitened  image.  The  background 
data  was  taken  from  the  ADTS  data  set  obtained  from  Lincoln  Laboratories.  The  images 
used  were  polarimetrically  whitened  SAR  images  depicting  terrain  in  Stockbridge  NY  at  1 
ft.  by  1  ft.  resolution.  The  final  test  images  were  obtained  by  overlaying  the  targets  onto 
the  clutter  backgrounds  by  masking  out  a  region  of  the  clutter  corresponding  to  the  convex 
hull  of  the  brightest  target  pixels  and  inserting  the  target  image  into  the  masked  area. 

The  original  data  set  allowed  for  the  construction  of  318  poses  of  the  target  at  equi-spaced 
angles  from  0  to  360  degrees.  Throughout  the  following  discussion  we  used  only  a  subset 
of  this  set  corresponding  to  106  images  or  53  (every  other  image  of  the  106)  representing 
poses  from  pose  angle  0  to  120  degrees.  This  restricted  range  was  used  to  better  expose  the 
behavior  of  the  ITDT  as  a  pose  estimator  in  noise  without  the  apparent  large  errors  that 
are  induced  by  selection  of  estimates  opposed  to  the  correct  angle  by  180  degrees  due  to  the 
near  perfect  symmetry  of  the  target  at  certain  poses.  The  53  image  set  was  used  to  test  the 
effects  of  less  complete  training  sets. 

In  Table  1  we  show  the  performance  of  the  ITDT  system  with  respect  to  error  of  esti¬ 
mation  of  azimuthal  pose  angles  of  the  test  targets  under  various  conditions.  The  first  entry 
describes  an  evaluation  which  simply  tests  the  sensitivity  of  the  system  to  clutter  back¬ 
grounds.  That  is,  the  test  images  were  constructed  by  placing  the  106  training  images  on  a 
large  number  of  clutter  backgrounds.  If  the  algorithm  was  totally  insensitive  to  the  clutter 
background,  the  error  would  be  zero  in  this  case  as  no  other  perturbations  are  present.  The 
small  0.67  degree  std.  dev;  reveals  a  very  small  perturbation  of  the  estimate. 

The  second  entry  in  the  table  describes  pose  estimation  performance  in  the  case  of  an 
ITDT  constructed  from  the  53  image  subset  of  target  images  and  tested  with  the  full  106  im¬ 
age  set  placed  on  various  clutter  backgrounds.  Furthermore,  the  test  images  were  corrupted 
by  the  addition  of  simulated  SAR  speckle  noise.  Again  the  performance  is  quite  good  with 
a  pose  estimate  std.  dev.  of  0.86  degrees. 

Finally  the  third  entry  tests  the  improvement  that  can  be  obtained  by  increasing  the  size 
of  the  training  set.  Here  the  ITDT  is  constructed  from  all  106  target  images  and  tested  with 
the  full  106  image  set  placed  on  various  clutter  backgrounds.  Again,  the  test  images  were 
corrupted  by  the  addition  of  simulated  SAR  speckle  noise.  Again  the  performance  is  quite 
good  with  a  pose  estimate  std.  dev.  of  0.66  degrees. 

The  prototype  successfully  demonstrates  logarithmic  processing  time  growth  with  in¬ 
creasing  exemplar  set  size  and  a  performance  that  exceeds  all  of  our  previous  ATR  systems 
as  shown  in  the  following  ROC  curve.  To  compare  the  performance  for  a  given  amount  of 
processing,  we  need  to  make  the  following  observations: 

•  The  number  of  floating  point  operations  associated  with  the  ITDT  algorithm  for  each 
image  tested  is  approximately  equal  to  7 N  where  N  is  the  number  of  pixels  in  the 
region  of  interest  and  the  constant,  7,  obtains  from  the  number  of  decisions  made. 

•  The  number  of  floating  point  operations  associated  with  the  LSD/DOA  algorithm  for 
each  image  tested  is  approximately  equal  to  IbN  where  N  is  the  number  of  pixels  in  the 
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II  Decision  Tree  Test  Suite  error  mean  std.  deviation - 

t72X  10_pwf.dB.156.fl06.DT  t72J106.onbg.suite  |  0.28  degrees  | 

0.67  degrees 

106  image  (0..120  deg.)  DT  tested  with  106  unspeckled  images(0..120  deg.) 

on  clutter 

t72  156.fl06e53.DT  t72.157.fl06.onbg.suite  |  0.76  degrees  | 

0.86  degrees 

- - - — — : - “T - T— - 

53  image  (0..120  deg.)  DT  (2.3  deg.  spacing)  tested  with  106  speckled  images(0..120  deg.)  on  clutter — fl 

II  t72_L  10  Dwf.dB.156.fl06.DT  t72.157.fl06.onbg.suite  0.28  degrees  ] 

0.66  degrees  1 

[  106  image  (0..120  deg.)  DT  (1.14  deg.  spacing)  tested  with  106  speckled  images(0..12Q  deg.)  on  clutter j 

Table  1:  Performance  of  ITDT  with  respect  to  estimating  the  azimuthal  pose  angle  of  targets 
under  various  conditions. 


region  of  interest  and  the  constant,  15,  obtains  from  the  number  linear  filter  operations 
(4)  and  number  of  metric  evaluations  (11). 

•  The  number  of  floating  point  operations  associated  with  the  ITDT  algorithm  for  each 
image  tested  is  approximately  equal  to  15 N  where  N  is  the  number  of  pixels  in  the 
region  of  interest  and  the  constant,  15,  is  given  as  for  LSD/DOA. 

Thus  we  see  that  at  less  than  half  the  processing  load,  the  ITDT  achieves  performance 
dramatically  in  excess  of  the  other  methods  at  the  lowest  false  alarm  rates. 
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Target:  T72  tank  /  Backgrounds:  MIT-LL  clutter 


Figure  4:  ROC  curves  for  the  ITDT,  LSD/DOA  and  PERFORM  ATR  systems. 

3  Derivation  of  Information  Theoretic  ATR  Performance 
Bounds 

We  develop  in  this  section  a  set  of  ATR  Performance  bounds  based  upon  the  application 
of  information  theory  to  decision  problems.  After  discussing  the  relevance  of  information 
theoretic  bounds  on  the  ATR  problem,  and  reviewing  certain  well  known  results,  an  example 
application  is  examined.  The  example  demonstrates  that  the  Blahut  bound  is  extremely  loose 
for  even  a  simple  problem.  This  leads  to  the  derivation  of  a  new  tighter  bound  based  on 
Fano’s  inequality  which  we  call  the  Fano-Envelope  bound. 


3.1  Approach 

We  consider  the  problem  of  correctly  recognizing  a  single  target  from  a  finite  set  of  possi¬ 
bilities.  Define  a  discrete  random  variable,  X,  with  alphabet  X  =  {ci,c2,c3,  •  •  -cw}>  corre- 
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Figure  5:  An  N-ary  target  recognition  problem  with  input  alphabet  equal  to  the  set  of 
possible  targets  and  output  alphabet  equal  to  the  set  of  decisions  made  by  the  ATR. 


sponding  to  the  set  of  all  target  classes  under  consideration,  let  px(ci)  be  the  probability 
mass  function  of  X.  An  ATR  makes  an  observation  F,  which  is  another  discrete  random 
variable  with  alphabet  F  =  {j/i,  jfei  2/3?  *  *  *  Vn}->  and  described  by  the  conditional  probabilities 
pY{yi  I  Cj).  Finally,  from  Y  an  estimate  of  A,  X  =  /(F),  is  made.  This  simplest  form  of  the 
target  recognition  process  is  a  random  mapping  from  the  set  of  possible  targets  onto  itself, 
and  is  shown  in  Figure  5. 

The  receiver  operator  characteristic  (ROC)  diagram  plots  the  probability  of  detection 
against  the  probability  of  false  alarm  for  binary  hypothesis  testing  problems.  The  curve  is 
generated  by  parametrically  varying  the  threshold  applied  to  the  decision  metric,  which  is 
the  only  statistic  available  to  distinguish  between  the  two  hypotheses. 

In  the  binary  hypothesis  testing  problem,  X  =  {co,Ci}  and  we  denote  the  probability  of 
making  a  Type  I  error  (false  alarm,  or  false  positive)  as  P10  =  p{y  1  |  Co)  and  the  probability 
of  making  a  Type  II  error  (missed  detection,  or  false  negative)  as  P0i  =  p{y 0  I  ci)-  For 
notational  convenience,  we  let  7To  =  Pxifi 0)  and  7Tx  =  px(c  1),  and  recall  that  tti  =  1  tto- 

We  will  now  address  the  question  of  how  to  combine  information  theoretic  concepts  and 
translate  them  into  a  form  compatible  with  the  ROC  diagram  evaluation  of  ATR  perfor¬ 
mance.  In  particular,  we  will  examine  Fano’s  inequality  from  information  theory,  generate 
a  performance  bound  on  a  ROC  diagram  for  the  binary  hypothesis  testing  problem,  and 
compare  it  to  another  ROC  performance  bound  derived  from  information  theoretic  concepts 
by  Blahut  [2]. 

3.2  Fano’s  Inequality 

Suppose  we  know  a  random  variable  F  and  wish  to  estimate  the  value  of  a  correlated  random 
variable  X.  Fano’s  inequality  relates  the  probability  of  error  in  estimating  the  random 
variable  X  to  its  conditional  entropy  H(X  |  F). 

Theorem.  Fano’s  Inequality  [1].  Let  X,Y  be  random  variables  with  respective  alphabets 
X  and  y.  Let  F  be  related  to  X  through  the  conditional  probabilities  p(y  |  x).  From  F ,  an 
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estimate  of  X  X  =  /(F)  is  produced.  If  we  define  E  :  X  x  X  — ►  {0, 1}  to  be  a  binary  error 
random  variable  such  that 

E(x,  f(y)  =  x)  =  0 
E(x,  f(y)  ±  x)  =  1 

with  probability  Pe  =  Pr(E  =  1)  and  entropy 

H{E)  =  H(Pe)  =  -Pt  log2(Pe)  -  (1  -  Pe)  loga(l  -  Pe)-  C1) 

It  is  the  case  then  that 

pt iog2(|*|  - 1) + n{Pe)  >  n{x  |  y),  (2) 

where  \X\  is  the  number  of  elements  in  the  alphabet  of  the  random  variable  X ,  and  the 
conditional  entropy  is  defined  to  be 

U{X  |  Y )  =  -  5Z  £  p{xy)  log2 p{x  |  y).  (3) 

sexyzy 

In  the  binary  hypothesis  testing  case,  \X\  =  2,  so  Fano’s  inequality  reduces  to 

U{Pt)  =  -Pe  log2(Pe)  -  (1  -  Pe)  log2(l  ”  Pe)  >  W(X  |  Y).  (4) 

3.3  Blahut’s  Bound 

The  Blahut  bound  on  the  probabilities  of  Type  I  and  Type  II  errors,  Poi,Pio,  is  built  on 
the  relative  entropy  (discrimination,  Kullback-Leibler  distance)  for  a  pair  of  discrete  random 
variables  X  and  Y  with  respective  distributions  p(xk)  and  p{yk)  [2].  Specifically,  if  Pio  < 
e~nr,  then 

Pol  >  i - — T  —  c-n*r-C^2  (5) 


n(r  —  c) 


where 


^=£pW(.o65M)2-(i;pwiog 
c=i;pw  i°g ^  . 


p(yfc)\2 

pMJ 


is  the  relative  entropy,  r  >  c  is  the  relative  entropy  between  a  dummy  distribution  and 
p(ifc)  which  is  varied  to  produce  samples  for  the  bound,  and  n  is  the  number  of  independent 
measurements.  Blahut’s  bound  has  its  roots  in  the  block  coding  converse  to  the  coding 
theorem  for  noisy  channels  in  information  theory  [6].  There  r  corresponds  to  the  transmission 
rate  of  a  source,  c  the  capacity  of  a  discrete  memoryless  channel,  and  n  the  block  length. 
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3.4  Performance  Bounds  on  a  ROC  curve 

A  ROC  curve  plots  the  probability  of  detecting  a  target,  1  —  P0i,  against  the  probability  of 
false  alarm,  Pi0.  The  total  error  probability  is  related  to  these  conditional  probabilities, 

Pe  =  Poi*l  +  PlO^O-  (8) 


This  equation  can  be  rewritten  as, 


Poi  =  -Pio-  + 


a  relation  between  Poi  and  Pio  corresponding  to  a  curve  on  a  ROC  diagram. 

It  is  apparent  from  this  expression  that  for  given  prior  probabilities  (7r0,  ttj),  constant 
values  of  Pe  correspond  to  straight  lines  on  a  ROC  diagram  as  shown  in  Figure  6.  It  then 
follows  that  the  straight  line  defined  by  the  minimum  value  of  the  probability  of  error, 
Pe  >  Pemin,  partitions  the  ROC  diagram  into  two  regions  corresponding  to  achievable  and 
unachievable  performance.  The  triangular  region  in  the  ROC  diagram  inaccessible  to  ATR 
performance  can  be  written  as 

_  pmin 

ft,.,  =  {(no,  Pm)  I  Pm  <  -Pio-  +  -j—)  (10) 

7T  x  7T i 

for  given  prior  probabilities,  (7To,7r1),  and  is  illustrated  in  Figure  6. 

If  the  prior  probabilities,  (7T0,  7Ti),  are  varied  in  a  parametric  sense,  both  the  slope  of  the 
line  and  the  intercept  in  Equation  9  change  producing  a  new  region  of  inaccessibility.  The 
set  of  all  such  regions,  define  a  family  of  bounds  on  ATR  performance. 

A  family  of  bounds  can  be  generated  from  Fano’s  inequality  by  parametrically  varying 
the  values  of  (no,  ir\)  in  Equation  10.  For  clarity,  we  examine  these  bounds  in  the  context  of  a 
specific  binary  hypothesis  testing  problem.  We  start  by  defining  the  conditional  probabilities 

Pr(y  I  Co)  =  po{y)  =  (jg>64>  256’25e)’  ^ 

Pr(y  |  ci)  =  Pi(y)  =  ,  — ,  — ,  — (12) 

The  Blahut  bound  is  computed  from  Equation  5  by  treating  r  as  an  independent  variable 
and  computing  Poi.  For  this  problem  c  =  5.165  and  a2  =  1.708.  This  calculation  yields  only 
a  portion  of  the  bounding  curve  due  to  the  restriction  that  r  >  c.  To  construct  the  other 
portion,  the  roles  of  the  error  probabilities  P01  and  Pio  are  reversed,  and  the  computation 
repeated. 

We  illustrate  such  a  family  of  bounds  for  the  specific  hypothesis  testing  problem  intro¬ 
duced  above.  Figure  7  shows  a  family  of  bounds  generated  from  multiple  applications  of 
Fano’s  inequality  and  Equation  10.  For  comparison,  the  Neyman-Pearson  ROC  curve  and 
Blahut’s  bound  are  also  presented. 

Equation  10  introduced  the  concept  of  a  region  of  inaccessiblity  for  the  performance  of 
an  ATR  system.  If  we  consider  the  union  of  the  regions  of  inaccessiblity  over  (7To,  7Ti), 

K=\jRn0*1,  (13) 
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Figure  6:  ROC  diagram  partitioned  by  a  lower  bound  on  the  total  probability  of  error  Pe. 


then  it  is  evident  that  H  itself  defines  a  bound  on  ATR  performance  over  all  values  of 
(ttoi  tti  )  •  This  bound  is  readily  seen  to  be  described  by  the  envelope  formed  by  the  sequence 
of  lines  generated  by  the  family  of  Fano  derived  bounds.  Hence,  we  call  this  new  bound  the 
Fano-Envelope  bound. 

3.5  Discussion 

We  have  developed  a  family  of  bounds  on  ATR  performance  based  on  Fano’s  inequality  from 
information  theory  and  expressed  in  terms  of  the  ROC  diagram  from  hypothesis  testing. 
We  were  able  to  demonstrate  that  the  associated  Fano-Envelope  bound  that  is  generated  by 
the  envelope  of  the  family  of  bounds  we  obtained  is  a  much  tighter  bound  relative  to  the 
Neym an- Pearson  ROC  curve  than  the  Blahut  inequality. 

^From  a  pragmatic  viewpoint,  however,  the  requirements  for  computation  of  the  bounds 
derived  from  Fano’s  inequality  equal  those  for  directly  generating  the  exact  ROC  curve  of  a 
given  ATR  system. 
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Figure  7:  A  family  of  bounds  obtained  from  Fano’s  inequality.  The  Neyman-Pearson  ROC 
curve  and  Biahut’s  bound  are  shown  for  comparison. 
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4  The  Information  Theoretic  Decision  Tree  ATR  Sys¬ 
tem 

4.1  Problem  and  Approach  Description 

4.1.1  Automatic  Target  Recognition  (ATR)  problem 

The  problem  we  are  trying  to  solve  in  the  following  formulation  of  an  ATR  architecture  can 
be  stated  as  follows. 

Given  a  set  of  images  which  represent  some  object  at  different  poses  (that 
is,  a  set  of  exemplars),  construct  a  system  capable  of  performing  the  following 
recognition  task: 

For  a  given  image,  perhaps  containing  a  target  related  to  the  training  suite 
of  images  by  corruption  with  noise  or  obscuration  and  embedded  on  background 
clutter,  estimate  the  pose  of  the  most  likely  target  and  return  a  metric  which 
describes  (in  some  sense)  the  strength  of  confidence  of  the  presence  of  the  target 
at  that  pose  in  the  image. 

The  brute-force  approach  to  solving  this  problem  would  involve  somehow  comparing  the 
given  test  image  to  each  image  in  the  training  set  of  images  (target  suite  database)  and  make 
a  decision  based  on  a  similarity  describing  metric  value. 

However,  ATR  systems  must  make  a  decision  in  a  short  period  of  time.  The  database 
for  a  realistic  training  suite  could  contain  hundreds  of  thousands  (or  even  billions)  of  high- 
resolution  images  owing  to  the  presence  of  several  degrees  of  target  pose,  illumination,  his¬ 
torical  excitation  and  articulation  freedom;  thus  brute-force  methods  can  be  unusably  slow. 

4.1.2  Decision  Tree  Approach 

The  basis  for  a  Decision  Tree  approach  involves  creation  of  a  set  of  decision  rules  for  a  given 
database  and  replacement  of  brute-force  comparisons  between  target  image  and  database 
images  with  testing  of  the  target  image  against  a  contextually  selected  sequence  of  these 
decision  rules.  The  goal  is  to  make  the  operation  of  testing  against  these  decision  rules  fast 
and  to  keep  the  total  number  of  decision  rules  tested,  to  obtain  a  final  decision,  small. 

Although  the  price  for  faster  on-line  processing  will  be  time  needed  to  construct  a  set  of 
decision  rules  for  a  given  set  of  database  images,  this  can  be  done  off-line,  i.e.,  prior  to  the 
recognition  process. 

4.1.3  A  Linear  Decision  Tree  Architecture 

Let’s  assume  that  there  are  K  images  in  the  training  image  database  and  each  is  nx  x  n2 
pixels  in  extent,  thus  we  can  think  of  the  suite  of  images  as  K  vectors  or  points  in  the 
N-dimensional  space  that  contains  all  possible  images,  where  N  =  rii  x  n2. 

As  is  well  known,  the  problem  of  target  recognition  now  corresponds  to  the  division  of  this 
space  into  regions  which  describe  equivalence  classes.  In  the  simplest,  binary  hypothesis  case, 
we  would  assign  membership  of  every  point  in  the  space  to  one  of  two  equivalence  classes, 

_ _ _ _ _ _2Q _ __ _ 


Iomr  2 


REPORT  DOCUMENTATION  PAGE  (SF298) 
(Continuation  Sheet) 


that  of  Ho  which  represents  the  no-target-present  hypothesis  and  Hi  which  represents  the 
target-present  hypothesis. 

Generally,  we  would  want  to  divide  the  space  further  into  equivalence  classes  representing 
each  possible  target  (multi-hypothesis  problem)  or  representing  values  of  a  parameter  associ¬ 
ated  with  the  target  (composite  hypothesis  problem.)  In  the  sequel  we  will  treat  the  case  of 
a  single  target  type,  with  a  continuous  parameter  represented  as  a  set  of  discrete  values  (for 
which  exemplars  of  the  resulting  target  image  are  available  in  our  training  suite.)  Further¬ 
more,  we  shall  concern  ourselves  with  the  division  of  the  space  into  equivalence  classes  only 
with  respect  to  the  parameter  values.  That  is,  there  will  be  no  equivalence  class  associated 
with  the  Hq  hypothesis. 

What  would  appear  to  be  an  odd  rejection,  that  is,  not  considering  the  separation  of  the 
image  space  for  purposes  of  the  target  detection,  makes  more  sense  when  considering  the 
operation  of  the  classic  likelihood  ratio  based  Neyman-Pearson  composite  hypothesis  test. 
In  effect,  we  wish  to  maximize  the  likelihood  ratio  with  respect  to  all  possible  parameter 
values.  Then,  a  separate  decision  may  be  applied  to  the  result  regarding  the  target-presence 
hypothesis.  Likewise,  we  shall  construct  our  equivalence  class  division  so  as  to  derive  the 
match  metric  between  the  given  image  and  the  most  likely  parametric  representation.  This 
final  metric  may  then  be  judged  on  the  basis  of  a  contextual  threshold  value  determined  by  a 
constraint  such  as  probability  of  false  alarm  versus  probability  of  detection.  Incorporation  of 
the  final  detection  into  the  equivalence  class  construction  would  yield  a  complicated  variation 
of  class  boundaries  based  on  the  constraints  and  associated  thresholds  rather  than  the  fixed 
boundaries  yielded  by  the  current  approach. 

4.2  Hyper-plane  Based  Approximation 

The  construction  of  the  equivalence  class  boundaries  described  above  proceeds  directly  from 
the  association  of  every  point  in  the  N-D  image  space  with  the  exemplar  that  maximizes  the 
a  posteriori  probability  function.  However,  in  general,  these  boundaries  are  surfaces  with 
arbitrary  curvature.  That  is,  the  expense  of  storing  representations  of  these  boundaries  and 
of  testing  equivalence  set  membership  of  a  point  with  respect  to  these  boundaries  is  immense. 
In  fact,  exact  representation  in  the  most  general  case  requires  infinite  data  storage  capacity 
since  the  boundaries  are  generally  not  describable  by  less  than  an  infinite  number  of  power 
series  expansion  coefficients.  Hence,  we  need  to  consider  from  the  onset  the  approximate 
representation  of  the  decision  boundaries. 

One  approximate  representation  has  particular  appeal  and  will  the  basis  for  the  remaining 
development.  Any  continuous  surface  can  be  approximated  to  a  any  given  degree  by  a 
sufficiently  large  set  of  planar  facets.  Furthermore,  testing  a  point  for  inclusion  inside  a 
space  defined  by  a  set  of  L  planes  involves  only  computation  of  the  inner  products  of  the  the 
point  vector  representation  with  respect  to  the  L  normals  of  the  planes.  Thus,  the  degree  of 
approximation  and  computational  complexity  of  the  decision  are  both  simply  linked  to  the 
number  of  planes  chosen  for  the  representation. 

Thus,  building  a  set  of  decision  rules  can  be  associated  with  construction  of  hyper-planes 
which  partition  N-dimensional  space  into  hyper-plane  bounded  domains,  or  polytopes,  each 
containing  only  one  exemplar  database  point.  The  minimal  such  construction,  that  is,  one 
in  which  this  segregation  of  each  exemplar  into  a  unique  polytope  is  achieved  with  the 
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smallest  number  of  decision  planes  will  be  the  subject  of  the  next  phase  of  this  development. 
As  will  be  seen,  we  will  have  to  moderate  the  minimalist  nature  of  the  construction  to 
obtain  a  useful  approximation  of  the  original  equivalence  region.  However  the  construction 
of  significantly  better  approximations  of  the  true  equivalence  class  region  based  directly  on 
geometric  approximation  of  the  true  decision  region  is  considerably  more  complicated  and 
can  be  circumvented  to  some  extent  as  will  be  discussed  later. 

4.2.1  Hyper-plane  Decision  Tree  System 

Construction  of  a  minimal  hyper-plane  based  division  of  our  exemplars  would  appear  to 
be  related  to  a  binary  tree  representation  of  the  exemplar  set.  That  is,  considering  each 
hyper-plane  as  a  means  to  divide  a  set  of  points,  the  smallest  number  required  to  divide  that 
set  would  be  the  number  of  nodes  in  a  balanced  binary  tree  which  represents  consecutive 
divisions  of  the  exemplar  set.  Thus  given  K  exemplars  we  would  require  on  this  basis  no 
more  than  a  binary  tree  with  depth  log^K  and  K  —  1  associated  decision  planes.  While  the 
number  of  decision  planes  is  large,  the  number  of  tests  that  must  be  conducted  to  ascertain 
the  equivalence  class  membership  is  small,  that  being  logiK. 

As  we  discuss  the  construction  of  the  hyper-planes  that  represent  the  decisions  at  the 
nodes  of  this  binary  tree,  we  will  see  that  for  our  problem,  log2K  decisions  is  a  lower  bound 
and  not  an  entirely  appropriate  representation  given  the  need  to  preserve  performance  within 
this  approximate  decision  boundary  representation. 

The  above  statement  can  be  motivated  with  the  following  observation.  Consider  an 
arrangement  of  image  points  in  which  division  of  the  set  into  two  equal  halves  (a  prescription 
for  the  first  node  in  a  balanced  binary  tree)  would  require  that  the  decision  boundary  pass 
immediately  alongside  a  particular  point.  It  should  be  obvious  that  attaining  a  minimal 
representation  in  this  case  will  also  lead  to  an  incorrect  association  of  that  point  with  the 
nearest  neighboring  equivalence  class  nearly  half  the  time  given  even  a  small  noise  induced 
distribution  of  image  points  about  the  exemplar  point.  Our  minimization  of  representation 
must  be  tempered  by  issues  related  to  performance  in  the  context  of  stochastic  perturbation 
of  image  points  from  the  exemplar  locations. 

In  the  following  we  will  derive  an  appropriate  basis  for  the  division  of  our  exemplars 
by  hyper-planes  through  an  application  of  an  information  theoretic  measure  which  does  not 
result  in  a  balanced  binary  tree. 

The  first-level  hyper-plane,  in  a  decision  tree  based  scheme  thus  should,  it  would  seem, 
divide  all  points  into  roughly,  but  not  necessarily  exactly,  two  equal  subsets.The  second-level 
hyper-planes  divide  two  first-level  subsets  further  roughly  in  halves  again,  and  so  on  until 
all  points  are  separated  from  each  other  or  in  other  words  until  each  point  is  boxed  by  a  few 
hyper-planes  (Fig.8). 

Once  such  a  system  is  constructed,  the  recognition  process  becomes  nothing  but  a  se¬ 
quence  of  decision  steps,  where  at  each  step  the  target  image  is  tested  with  respect  to  which 
side  of  the  hyper-plane,  associated  with  the  given  node  in  the  decision  tree,  it  resides.  At 
each  step  the  number  of  possible  equivalence  classes  that  may  be  associated  with  the  image 
under  test  is  reduced  approximately  by  half  -  until  we  reached  a  polytope  containing  only 
one  exemplar  point;  this  exemplar  is  then  declared  as  the  closest  exemplar  to  the  test  image. 

The  system  of  hyper-planes  has  a  binary  tree  structure  and  therefore  the  on-line  complex- 
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Figure  8:  Decision  Dree  Construction 

ity  of  forming  the  complete  decision  may  be  expected  to  be  of  the  order  log2  K  operations 
as  opposed  to  K  operations  for  the  brute-force  comparison  approach. 


4.2.2  The  Information  Theoretic  Decision  Tree 

The  simple  assumption  behind  the  approach  outlined  above  is  that  in  the  high-dimensional 
image  space,  the  boundaries  of  the  decision  region  polytopes  containing  single  points  from 
the  exemplar  database  are  sufficiently  far  enough  away  from  the  exemplar  points  to  allow 
the  decision  process  to  tolerate  reasonable  test  image  perturbation  without  causing  a  choice 
of  a  neighboring  class.  In  other  words  we  assume  that  given  a  target  image  derived  from  an 
exemplar  image,  say,  R  =  Pk+n,  then  the  noise,  n,  will  not  shift  the  test  image  point,  R,  into 
a  polytope  associated  with  an  adjacent  exemplar.  Thus  segregation  of  the  exemplar  points 
into  individual  polytopes  cannot  be  the  only  basis  for  the  placement  of  decision  planes. 
Rather,  we  must  consider  the  probability  densities  that  are  associated  with  the  a  priori 
statistics  of  the  image  generation  process  as  these  determine  the  appropriate  placement  of 
decision  planes  between  adjacent  exemplar  points. 

In  the  following  sections  we  will  develop  a  means  for  determining  the  placement  of  the 
decision  planes,  and  implicitly,  the  form  of  the  binary  tree,  -so  that  a  hierarchical  decision 
process  results.  While  other  coarse-to-fine  hierarchical  decision  processes  have  been  proposed 
in  the  past,  the  approach  to  be  described  more  fully  below  has  the  following  special  and 
attractive  attributes: 

•  An  information  theoretic  measure  is  used  to  drive  the  process  of  dividing  the  decision 
space.  This  optimizes  the  amount  of  “information”  yielded  by  each  decision  made  in 
the  final  process.  If  the  imagery  divides  “cleanly”  then  this  procedure  guarantees  the 
decision  tree  has  the  well  known  optimum  number  of  decisions  associated  with  that 
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number  of  image  points.  Otherwise,  the  decision  process  has  the  smallest  number 
of  decisions  allowed  by  the  data  and  consistent  with  minimal  loss  of  discriminatory 
information  at  each  step. 

•  The  original  image  space  is  the  setting  for  the  decision  process.  Thus  no  artificial  reduc¬ 
tions  in  dimensionality  are  introduced  and  ail  relevant  image  information  is  preserved 
at  each  step  in  the  process. 

•  The  solution  for  the  optimum  decision  tree  is  intractable,  however,  the  penalty  for 
using  a  sub-optimum  solution  is  only  additional  decisions  (hence  processing  time)  and 
not  any  loss  of  ultimate  estimation  and  target  recognition  performance. 

4.3  Information  Theoretic  Formulation 

In  this  section  the  mathematical  background  for  the  previously  described  concepts  will  be 
presented.  Suppose  now  that  for  a  given  set  of  database  points  we  have  already  placed  a 
trial  first-level  hyper-plane.  The  following  figure  describes  the  current  decision  process  and 
associated  variables. 

(Fig-9): 


Figure  9:  First-level  decision  snb-system 

For  a  given  test  image  R,  which  is  assumed  to  be  derived  from  one  of  the  exemplar  images 
X  =  pk  corrupted  with  noise  n,  a  decision,  Y,  is  produced,  which  indicates  on  which  side 
of  the  hyper-plane  point  R  is  located,  hence  an  association  of  R  with  one  of  two  groups  of 
equivalence  classes. 

To  determine  the  appropriate  placement  of  the  hyper-plane,  consider  a  decision  Y  for 
some  given  hyper-plane  placement:  how  much  information  I  about  the  random  hypothesis 
process  X  can  we  obtain  based  on  observation  of  Y ? 

An  appropriate  definition  of  7  in  a  decision  theoretic  context  is  the  mutual  information 
between  Y  and  the  random  process  X.  Given  this  definition  it  is  at  once  obvious  that 

0<I  =  I(pk\Y)<l  (14) 

since  we  can  derive  no  more  than  one  bit  of  information  from  one  binary  decision.  Obviously, 
we  would  want  to  derive  as  much  information  from  such  a  decision  as  is  possible.  Thus,  to 
make  a  well  informed  final  decision  from  small  numbers  of  intermediate  decisions,  we  are 
motivated  to  solve  for  hyper-plane  placement  on  the  basis  of  maximizing  the  information 
derived  from  each  intermediate  decision. 

It  is  important  to  note  that  what  we  are  engineering  is  no  improvement  upon  ATR  per¬ 
formance  over  that  available  with  a  Bayesian  or  Neyman-Pearson  construction.  In  fact, 
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there  will  in  general  be  a  loss  of  performance  with  respect  to  these  provably  optimal  ap¬ 
proaches.  However,  this  development  focuses  on  the  minimization  of  computational  effort 
(the  decisions)  while  retaining  good  performance.  This  minimization  is  achieved  by  fixing 
a  computational  architecture  and  applying  information  theoretic  optimization  to  minimize 
the  effort  required  to  derive  information  from  the  data. 

4.3.1  Mutual  Information  Function 

Now  we  must  determine  the  best  hyper-plane  position  so  that  the  information  measure,  I, 
is  maximized  for  a  given  set  of  exemplar  points.  In  order  to  solve  this  problem  we  need  to 
express  1  in  terms  of  hyper-plane  characteristics  and  the  set  {p*} 

Lets  express  I  in  terms  of  entropy  [1]  : 

/({?,};  Y)  =  H  ({p,})  +W  (Y)  -  H  ({p,},  Y),  (15) 

where 

H  (x)  =  -  Pr(x  =  Xi)  log2(Pr(rr  =  n))  (16) 

Xi 

Thus 

H  O')  =  -Y.Pr{y  =  »)log2(Pr((y  =  yi))  (17) 

Vi 

'H  (Pi)  =  -  H  Pr(Pi)  log 2(Pr(pi))  (18) 

Pi 

Note  that  H  (pi)  is  just  a  constant  for  a  given  probability  distribution  associated  with  the 
(pi,r)  =  -53^Pr(pi,F  =  yi)log2(Pr(pi,y  =  yi))  (19) 

Pi  Vi 

Pr(pi,Y  =  yi)  =  Pr(pi)  f  p{r\pi)dr  (20) 

JY=Vi 

where 

Pr(pi)  is  the  probability  of  the  occurrence  of  pi, 

p(r|pj)  is  the  probability  distribution  function  of  the  target  r  given  pose  pi,  and 
Pr(Y  =  Vi)  =  Pr(Y  =  2/i|pi)Pr(pi). 

Now,  any  hyper-plane  in  N-dimensional  space  can  be  uniquely  defined  by  (N+l)  coef¬ 
ficients.  (For  example  the  equation  of  a  hyper-plane  in  2  dimensions,  a  line,  is  given  by 
ay  +  bx  +  c  =  0,  where  {a,6,c}.)  We  will  describe  the  hyper-plane  by  coefficients  wQ..w^. 

Now,  given  hyper-plane  coefficients  w0..w^  we  can  uniquely  compute  Y  for  a  given  point 
r  in  the  N-dimensional  space,  thus  we  can  express  I  in  terms  of  w0..wN  and  the  set  of  {pj. 

The  maximization  of  /  with  respect  to  w0--wN  is  a  highly  non-linear  global  optimization 
problem  for  which  possibly  many  suboptimal  solutions  can  be  found.  In  the  current  im¬ 
plementation  of  this  work  this  optimization  is  performed  by  a  multidimensional  conjugate 
gradient  method  [11]. 
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4.3.2  Decision  Tree  Construction 

Given  a  procedure  that  places  a  hyper-plane  to  divide  a  given  set  of  points  so  as  to  maxi¬ 
mize  the  information  gained  by  the  decision,  the  decision  tree  construction  process  becomes 
nothing  but  recursive  application  of  that  procedure  to  the  each  subset  of  points  derived  by 
each  such  division.  The  hyper-plane  coefficients  are  then  stored  and  associated  with  nodes 
of  the  decision  tree. 

5  Directions  for  Further  Research 

The  ITDT  ATR  architecture  developed  in  the  course  of  this  work  serves  as  a  demonstration 
of  the  direct  applicability  of  notions  from  information  theory  to  realization  and  optimization 
of  ATR  technology.  While  the  prototype  has  proven  to  be  quite  interesting  and  powerful, 
many  directions  for  investigation  and  enhancement  are  open,  including: 

•  implementation  of  2-DOF,  3-DOF, ...,  N-DOF  versions; 

•  extensive  testing  with  “real”  data  sets  with  detailed  statistical  characterization  of  sen¬ 
sor,  target,  environment,  variation  of  several  DOFs  and  high  accuracy  truthing; 

•  development  of  means  to  deal  with  the  size  of  the  decision  tree  storage,  via  specialized 
compression  techniques; 

•  development  of  theory  giving  rise  to  a  metric  function  that  supports  partial  object 
recognition  within  this  framework; 

•  implementation  of  a  distributed  network  computer  based  system  for  the  time  consum¬ 
ing  off-line  computation  of  the  decision  tree  solution. 

•  exploration  of  the  performance  and  storage  requirements  associated  with  an  ITDT 
which  foregoes  target  orientation  analysis  (yielding  only  a  target/no-target  decision). 
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